1. Introduction to Open Data Science 2018

This course will provide a lot of very useful basic knowledge on open source data workflow.

Link to my GitHub repository:
https://github.com/MikkoPatronen/IODS-project


2. Regression and model validation

1. The structure and dimensions of the data

## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The data is a subset of all the variables from the data: “International survey of Approaches to Learning”. The data consists of 166 rows (observations) and 7 columns (variables). The variables are “gender”,“age”,“attitude”, “deep”, “stra”, “surf” and “points”. The variables “gender” (in years) and “age” (M=male, F=female) are self-explanatory. The variable “attitude” describes global attitude toward statistics (it is sum of ten variables). The variable “deep” describes deep approach (it is a mean of three variables). The variable “stra” describes strategic approach (it is a mean of two variables). The variable “surf” describes surface approach (it is a mean of three variables). The variable “points” is exam points.

2. A graphical overview and summaries of variables in the data

##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :32.00   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :31.43   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :50.00   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

There are almost twice as many answers from females (n = 110) than males (n = 56). Their age varies between 17 and 55 years, half of the participants being at least 22 years old.

3. Regression model

I chose the variables “attitude”, “stra” and “surf” as explanatory variables for the target variable “points”. Here is the summary for the fitted model:

## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

According to the model the variables “stra” and “surf” do not have statistically significant relationship with the target (p-values 0.117 and 0.466). Since the variable “surf” has bigger p-value, let us remove that from the model and fit a new model without it. Here is the new fitted model summary:

## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.97290    2.39591   3.745  0.00025 ***
## attitude     0.34658    0.05652   6.132 6.31e-09 ***
## stra         0.91365    0.53447   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

Now the model summary shows that the variable “stra” does not have statistically significant relationship with the target (p-value is 0.089), so I create a new model without it. Here is the final model summary:

## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

This model has an assumption that the attitude towards statistics has a linear relationship with exam points. The explanatory variable “attitude” has a statistically significant relationship with the target variable “points” (p-value almost 0).

4.

The model shows that the y-intercept is at 11.637 and the linear model has a positive slope of 0.353, which means that when the attitude goes up one point, the exam points rise 0.353. This seems reasonable.

Multiple R squared can be interpreted as displaying the amount of variance (in percentages) the explanatory variable(s) explain in the target variable. So in this model it could be interpreted that the attitude explains about 19 percent of the variance in exam points.

5.

The model has an assumption that the residuals are normally distributed. The residuals are the differences between the predicted and observed values in a given point.


3: Logistic regression

The data consists of two data sets retrieved from the UCI Machine Learning Repository using Student performance dataset from https://archive.ics.uci.edu/ml/datasets/Student+Performance. The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). One table was students from a math course, and the other students from Portuguese course. The two datasets were combined for this analysis.

The data consists of 382 obsevations (students) and 35 variables. The 35 column names are as follows:

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

Two variables are describing the student’s alcohol use. The variable “alc_use” is the average alcochol use on weekdays and weekends together. The variable “high_use” indicates if the average alcohol use is more than 2.

I chose 4 variables that I believe to have some relationship with the alcohol consumption. The variables are “sex”, “studytime”, “activities” and “romantic”. The gender (“sex”), extra-curricular activities (“activities”) and relationship status (“romantic”) are binary variables. Weekly studytime is numeric varying in 4 different categories between 1 and 10 hours. I would hypothesize that males, those who do not have extra-curricular activities, those who are not in a romantic relationship, and those with lower amounts of weekly studytime, have higher probability to high use alcohol consumption.

Tables and numerical evaluation of the relationships

High alcohol use vs. gender

I think a barplot will describe the relationships quite well. This barplot below shows the mean of alcohol consumption according to gender. It can clearly be seen from the plot that males are the majority in larger means of alcohol consumption. This is especially clear when the mean value is 3 or higher. The summary statistic by group tells us the numbers: there are 72 males and 42 females in the high alcohol use group. This supports our hypothesis.

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## # A tibble: 4 x 3
## # Groups:   sex [?]
##   sex   high_use count
##   <fct> <lgl>    <int>
## 1 F     FALSE      156
## 2 F     TRUE        42
## 3 M     FALSE      112
## 4 M     TRUE        72

High alcohol use vs. extra-curricular activities

The next barplot shows the mean of alcohol consumption according to the binary variable “activities”, which describes whether or not the individual has extra-curricular activities. This plot is not as clear as the previous one, but those who do not have extra-curricular activities are a slight majority in mean values above 2. The summary statistic by group shows that there are 59 “no’s” and 55 “yes’s” in the high alcohol use group. This supports our hypothesis, but is really small difference.

## # A tibble: 4 x 3
## # Groups:   activities [?]
##   activities high_use count
##   <fct>      <lgl>    <int>
## 1 no         FALSE      122
## 2 no         TRUE        59
## 3 yes        FALSE      146
## 4 yes        TRUE        55

High alcohol use vs. being in a romantic relationship

The next barplot shows the mean of alcohol consumption according to the binary variable “romantic”, which describes whether or not the individual is in a romantic relationship. This plot clearly supports our hypothesis: those who are not in a romantic relationship are the majority in mean values of 2 and higher. The summary statistic by group is also very clear to support our hypothesis by showing that in the high alcohol use group there are 81 of those who are not in a romantic relationship but only 33 of those who are in a romantic relationship.

## # A tibble: 4 x 3
## # Groups:   romantic [?]
##   romantic high_use count
##   <fct>    <lgl>    <int>
## 1 no       FALSE      180
## 2 no       TRUE        81
## 3 yes      FALSE       88
## 4 yes      TRUE        33

High alcohol use vs. studytime

Finally we have a boxplot that also supports our hypothesis: those who have lower amounts of studytime per week are forming the majority in having high alcohol use. The summary statistic by group also shows that when the studytime decreases, the amount of individuals that have high alcohol use increases.

## # A tibble: 8 x 3
## # Groups:   studytime [?]
##   studytime high_use count
##       <int> <lgl>    <int>
## 1         1 FALSE       58
## 2         1 TRUE        42
## 3         2 FALSE      135
## 4         2 TRUE        60
## 5         3 FALSE       52
## 6         3 TRUE         8
## 7         4 FALSE       23
## 8         4 TRUE         4

Logistic regression

Next I will conduct a logistic regression analysis to study the relationships between the independent variables (“sex”, “studytime”, “activities” and “romantic”) and the dependent variable (high_use: the mean alcohol use > 2). Here is the summary of the analysis:

## 
## Call:
## glm(formula = high_use ~ sex + romantic + activities + studytime, 
##     family = "binomial", data = alc2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2083  -0.8872  -0.6806   1.1819   2.0591  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   -0.17051    0.38376  -0.444  0.65681   
## sexM           0.69839    0.24552   2.845  0.00445 **
## romanticyes   -0.08315    0.25272  -0.329  0.74213   
## activitiesyes -0.26321    0.23571  -1.117  0.26414   
## studytime     -0.45540    0.15754  -2.891  0.00384 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 439.32  on 377  degrees of freedom
## AIC: 449.32
## 
## Number of Fisher Scoring iterations: 4
##   (Intercept)          sexM   romanticyes activitiesyes     studytime 
##    -0.1705116     0.6983944    -0.0831546    -0.2632122    -0.4553951

From the summary we can see that gender and studytime both have statistically significant estimates (p-value 0.00445 and 0.00384, respectively). I would like to see the intended results for “romantic” and “activities” also, since the reference level is automatically set to “yes” and I am supposed to reference the level “no”. I can’t find the way to change this setting so I’ll just do the analysis without those two variables. So here is the new model:

## 
## Call:
## glm(formula = high_use ~ sex + studytime, family = "binomial", 
##     data = alc2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1429  -0.8820  -0.7177   1.2123   2.1421  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -0.2621     0.3726  -0.704  0.48169   
## sexM          0.6618     0.2413   2.743  0.00610 **
## studytime    -0.4815     0.1566  -3.075  0.00211 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 440.71  on 379  degrees of freedom
## AIC: 446.71
## 
## Number of Fisher Scoring iterations: 4
## (Intercept)        sexM   studytime 
##  -0.2621283   0.6618371  -0.4814874
## Waiting for profiling to be done...
##                    OR     2.5 %    97.5 %
## (Intercept) 0.7694123 0.3707640 1.6029554
## sexM        1.9383500 1.2105015 3.1228545
## studytime   0.6178637 0.4502538 0.8331592

This model also gives statistically significant estimates for gender and studytime. Odds ratios can be interpreted so that males are 1.94 (95% confidence interval 1.21 - 3.12) times more likely to be high users of alcohol and also that studytime increase is associated with decreasing mean of alcohol consumption (OR 0.62, 95% confidence interval 0.45 - 0.83). These both interpretations support the original hypothesis: males are more likely to be high users of alcohol and low amount of studytime is associated with higher amounts of alcohol use.


4. Clustering and classification

This week we are using a data (included in the MASS package of R) called “Boston”, which consists of variables to study Housing Values in Suburbs of Boston. The variables in the data are “per capita crime rate by town”, “proportion of residential land zoned for lots over 25,000 sq.ft”, “proportion of non-retail business acres per town”, “Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)”, “nitrogen oxides concentration (parts per 10 million)”, “average number of rooms per dwelling”, “proportion of owner-occupied units built prior to 1940”, “weighted mean of distances to five Boston employment centres”, “index of accessibility to radial highways”, “full-value property-tax rate per $10,000”, “pupil-teacher ratio by town”, “1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town”, “lower status of the population (percent)” and “median value of owner-occupied homes in $1000s”. This information comes from here.

The structure and dimensions of the data are presented here:

data(Boston)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

As we can see, the dimensions of Boston data are 506 rows and 14 columns. Here is a summary of the variables:

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Now, let’s print correlation coefficients between variables and a graphical overview of those correlations in the data:

cor_matrix<-cor(Boston) %>% round(digits = 2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

In the correlation matrix the colors indicate the direction of the correlation (red is negative and blue is positive correlation). Darker colors indicate stronger correlations. Based on this plot we can see that there is one variable pair that has particularly strong positive correlation: “rad” and “tax” (“index of accessibility to radial highways” and “full-value property-tax rate per $10,000”), r=0.91. The variable “dis” has strong negative correlations with variables “age” (r=-0.75), “nox” (r=-0.77) and “indus” (r=-0.71). Also strong negative correlation is shown between variables “lstat” and “medv” (r=-0.74). The mentioned variables are: dis = “weighted mean of distances to five Boston employment centres” nox = “nitrogen oxides concentration (parts per 10 million)” indus = “proportion of non-retail business acres per town” lstat = “lower status of the population (percent)” medv = “median value of owner-occupied homes in $1000s”

Let’s standardize the dataset and print out summaries of the scaled data:

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

We can see that after standardizing the mean of each variable is zero.

# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, label = c("low", "med_low", "med_high", "high"), breaks = bins, include.lowest = TRUE)

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2623762 0.2326733 0.2450495 0.2599010 
## 
## Group means:
##                  zn      indus        chas        nox         rm
## low       0.9667397 -0.9048142 -0.08661679 -0.8841363  0.4182980
## med_low  -0.1200477 -0.2375617  0.06274329 -0.5527046 -0.1847747
## med_high -0.3815632  0.1759325  0.24466389  0.3594167  0.1659049
## high     -0.4872402  1.0170492 -0.12234430  1.0391104 -0.4700333
##                 age        dis        rad        tax     ptratio
## low      -0.9051405  0.8536728 -0.6839198 -0.7271865 -0.46620442
## med_low  -0.3211803  0.3073077 -0.5615811 -0.4277290 -0.02221155
## med_high  0.3981304 -0.3346304 -0.4099578 -0.3007148 -0.24727254
## high      0.8213257 -0.8610618  1.6388211  1.5145512  0.78158339
##                black       lstat        medv
## low       0.37173893 -0.75652658  0.50428356
## med_low   0.31116278 -0.07786545 -0.06868928
## med_high  0.07241503 -0.06921595  0.21521067
## high     -0.74650664  0.93571304 -0.71642081
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.10154812  0.69450191 -0.83101653
## indus    0.01113254 -0.29086989  0.19031801
## chas    -0.09058595 -0.04225589  0.16227843
## nox      0.42164075 -0.77586108 -1.31015820
## rm      -0.10292095 -0.09341653 -0.10493212
## age      0.23308805 -0.36306728 -0.20887145
## dis     -0.03544963 -0.33367468  0.02160288
## rad      3.17836534  0.94673592 -0.38234461
## tax     -0.02397136  0.01918269  0.82783519
## ptratio  0.10163743 -0.08284321 -0.17268339
## black   -0.12698479  0.05791888  0.14404636
## lstat    0.21316334 -0.17107222  0.45291571
## medv     0.16760263 -0.45910511 -0.21308796
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9480 0.0386 0.0134
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

You can check out my R script for the Data wrangling part here.


5. Dimensionality reduction techniques

This week’s tasks present dimensionality reduction techniques. Principal Component Analysis (PCA) will be applied.

The R script file that created the data file can be viewed here. Let us view information about the data:

5.1 Summary of the variables in the data:

rm(list=ls()) 
library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(MASS)
library(corrplot)
human <- read.table("human.txt")
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  0.198 0.931 0.861 0.977 0.989 ...
##  $ Labo.FM  : num  0.199 0.685 0.211 0.633 0.747 ...
##  $ Edu.Exp  : num  9.3 11.8 14 17.9 12.3 20.2 15.7 11.9 12.6 14.4 ...
##  $ Life.Exp : num  60.4 77.8 74.8 76.3 74.7 82.4 81.4 70.8 75.4 76.6 ...
##  $ GNI      : int  1885 9943 13054 22050 8124 42261 43869 16428 21336 38599 ...
##  $ Mat.Mor  : int  400 21 89 69 29 6 4 26 37 22 ...
##  $ Ado.Birth: num  86.8 15.3 10 54.4 27.1 12.1 4.1 40 28.5 13.8 ...
##  $ Parli.F  : num  27.6 20.7 25.7 36.8 10.7 30.5 30.3 15.6 16.7 15 ...

The data includes 155 observations of 8 variables. They are:

  • Edu2.FM - the ratio of female and male populations with secondary education in each country
  • Labo.FM - the ratio of labour force participation of females and males in each country
  • Life.Exp - life expectancy at birth
  • Edu.Exp - Expected years of schooling (years)
  • GNI - The gross national income
  • Mat.Mor - The maternal mortality ratio
  • Ado.Birth - the adolescent birth rate
  • Parli.F - the share of parliamentary seats held by women

Graphical overview of the data

summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
ggpairs(human)

Since the variables are not scaled, the means are quite different across variables.

The ggpairs()-plot shows distributions of the variables and shows the correlations between the variables. The first four of the variables (Edu2.FM, Labo.FM, Edu.Exp and Life.Exp) have negative skewness and the remaining four variables are positively skewed.

The corrplot() visualizes correlations between variables:

cor(human)
##                Edu2.FM      Labo.FM     Edu.Exp   Life.Exp         GNI
## Edu2.FM    1.000000000  0.009564039  0.59325156  0.5760299  0.43030485
## Labo.FM    0.009564039  1.000000000  0.04732183 -0.1400125 -0.02173971
## Edu.Exp    0.593251562  0.047321827  1.00000000  0.7894392  0.62433940
## Life.Exp   0.576029853 -0.140012504  0.78943917  1.0000000  0.62666411
## GNI        0.430304846 -0.021739705  0.62433940  0.6266641  1.00000000
## Mat.Mor   -0.660931770  0.240461075 -0.73570257 -0.8571684 -0.49516234
## Ado.Birth -0.529418415  0.120158862 -0.70356489 -0.7291774 -0.55656208
## Parli.F    0.078635285  0.250232608  0.20608156  0.1700863  0.08920818
##              Mat.Mor  Ado.Birth     Parli.F
## Edu2.FM   -0.6609318 -0.5294184  0.07863528
## Labo.FM    0.2404611  0.1201589  0.25023261
## Edu.Exp   -0.7357026 -0.7035649  0.20608156
## Life.Exp  -0.8571684 -0.7291774  0.17008631
## GNI       -0.4951623 -0.5565621  0.08920818
## Mat.Mor    1.0000000  0.7586615 -0.08944000
## Ado.Birth  0.7586615  1.0000000 -0.07087810
## Parli.F   -0.0894400 -0.0708781  1.00000000
res1 <- cor.mtest(human, conf.level = .95)

cor(human) %>% corrplot(p.mat = res1$p, method = "color", type = "upper",
sig.level = c(.001, .01, .05), pch.cex = .9,
insig = "label_sig", pch.col = "white", order = "AOE")

Blue color means positive correlation and red negative correlation. Lighter colors mean weaker association and darker colors stronger.

5.2. Principal Component Analysis (PCA) on the non-standardized dataset

Next we will conduct a principal component analysis (PCA) on the not standardized human data and show the variability captured by the principal components. A biplot will be drawn to display the observations by the first two principal components.

# create and print out a summary of pca_human
pca_human <- prcomp(human)
s <- summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
# rounded percetanges of variance captured by each PC
pca_pr <- 100*round(1*s$importance[2, ], digits = 3)

# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# create object pc_lab to be used as axis labels
pc_lab<-paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

5.3. Same with standardized data

Let us standardize the data and then perform the PCA again:

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# create and print out a summary of pca_human
pca_human <- prcomp(human_std)

s <- summary(pca_human)
s
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
##                            PC7     PC8
## Standard deviation     0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion  0.98702 1.00000
# rounded percetanges of variance captured by each PC
pca_pr <- 100*round(1*s$importance[2, ], digits = 3)

# print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# create object pc_lab to be used as axis labels
pc_lab<-paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

Standardizing the data makes a big difference. The results from un-standardized and standardized datas are very different: the results from un-standardized variables showed that PC1 captures all the variation, but with standardized variables PC1 now captures 53,6% (PC2 captures 16,2% and PC3 9,6%).

The variable Parli.F (the share of parliamentary seats held by women) has strong positive correlation with the variable Labo.FM (the ratio of labour force participation of females and males in each country). The variable Ado.Birth (the adolescent birth rate) is positively correlated with PC1.

5.4 Personal interpretations

PCA reduces dimensions of a dataset into components that explain the variance in the data. The arrows of the variables can be interpreted so that the PC1 consists of Mat.Mor, Ado.Birth, Edu.Exp, Edu2.FM, Life.Exp and GNI. PC2 consists of Parli.F and Labo.FM. PC1 explains 53,6% of the variance in the data. PC2 explains 16,2% of the variance.

5.5 Multiple Correspondence Analysis

The structure and the dimensions of the data:

I picked the variables “breakfast”, “evening”, “sex”, “sugar”, “where”, “lunch” to form my subdata. Here is some information about the structure of the data and some visualizations:

library(FactoMineR)
library(tidyr)
library(dplyr)

data("tea")

tea_mca = tea[, c("breakfast", "evening", "sex", "sugar", "where", "lunch")]

summary(tea_mca)
##          breakfast          evening    sex          sugar    
##  breakfast    :144   evening    :103   F:178   No.sugar:155  
##  Not.breakfast:156   Not.evening:197   M:122   sugar   :145  
##                                                              
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30
str(tea_mca)
## 'data.frame':    300 obs. of  6 variables:
##  $ breakfast: Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ evening  : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ sex      : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ sugar    : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where    : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch    : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_mca) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") +geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

The dataset tea_mca consists of 300 observations of 6 variables.

MCA:

mca <- MCA(tea_mca, graph = FALSE)

summary(mca)
## 
## Call:
## MCA(X = tea_mca, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.216   0.198   0.178   0.171   0.154   0.132
## % of var.             18.552  16.987  15.288  14.679  13.222  11.317
## Cumulative % of var.  18.552  35.538  50.827  65.506  78.727  90.045
##                        Dim.7
## Variance               0.116
## % of var.              9.955
## Cumulative % of var. 100.000
## 
## Individuals (the 10 first)
##                         Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## 1                    |  0.408  0.257  0.206 | -0.315  0.167  0.122 |
## 2                    | -0.567  0.495  0.487 | -0.272  0.124  0.112 |
## 3                    | -0.095  0.014  0.010 |  0.194  0.063  0.043 |
## 4                    |  0.601  0.556  0.460 | -0.364  0.223  0.169 |
## 5                    |  0.203  0.063  0.040 |  0.089  0.013  0.008 |
## 6                    |  0.116  0.021  0.018 | -0.475  0.379  0.296 |
## 7                    | -0.077  0.009  0.007 | -0.426  0.305  0.230 |
## 8                    | -0.095  0.014  0.010 |  0.194  0.063  0.043 |
## 9                    | -0.371  0.212  0.117 | -0.221  0.082  0.042 |
## 10                   | -0.091  0.013  0.006 |  0.294  0.145  0.062 |
##                       Dim.3    ctr   cos2  
## 1                     0.447  0.373  0.246 |
## 2                     0.028  0.001  0.001 |
## 3                    -0.740  1.022  0.632 |
## 4                    -0.153  0.044  0.030 |
## 5                     0.218  0.089  0.046 |
## 6                    -0.214  0.086  0.060 |
## 7                     0.386  0.278  0.188 |
## 8                    -0.740  1.022  0.632 |
## 9                     0.712  0.948  0.434 |
## 10                    0.544  0.554  0.211 |
## 
## Categories (the 10 first)
##                          Dim.1     ctr    cos2  v.test     Dim.2     ctr
## breakfast            |  -0.279   2.880   0.072  -4.637 |   0.068   0.187
## Not.breakfast        |   0.258   2.659   0.072   4.637 |  -0.063   0.173
## evening              |   0.513   6.948   0.137   6.410 |   0.903  23.532
## Not.evening          |  -0.268   3.633   0.137  -6.410 |  -0.472  12.303
## F                    |  -0.557  14.151   0.452 -11.624 |   0.167   1.396
## M                    |   0.812  20.646   0.452  11.624 |  -0.244   2.036
## No.sugar             |  -0.654  17.043   0.458 -11.701 |  -0.143   0.891
## sugar                |   0.700  18.218   0.458  11.701 |   0.153   0.953
## chain store          |   0.160   1.259   0.045   3.686 |  -0.046   0.112
## chain store+tea shop |  -0.661   8.738   0.153  -6.771 |   0.502   5.513
##                         cos2  v.test     Dim.3     ctr    cos2  v.test  
## breakfast              0.004   1.131 |   0.790  28.019   0.577  13.131 |
## Not.breakfast          0.004  -1.131 |  -0.730  25.864   0.577 -13.131 |
## evening                0.426  11.287 |  -0.279   2.500   0.041  -3.490 |
## Not.evening            0.426 -11.287 |   0.146   1.307   0.041   3.490 |
## F                      0.041   3.493 |  -0.368   7.526   0.198  -7.695 |
## M                      0.041  -3.493 |   0.538  10.981   0.198   7.695 |
## No.sugar               0.022  -2.561 |  -0.075   0.270   0.006  -1.338 |
## sugar                  0.022   2.561 |   0.080   0.289   0.006   1.338 |
## chain store            0.004  -1.050 |  -0.327   6.398   0.190  -7.541 |
## chain store+tea shop   0.089   5.147 |   0.500   6.081   0.088   5.128 |
## 
## Categorical variables (eta2)
##                        Dim.1 Dim.2 Dim.3  
## breakfast            | 0.072 0.004 0.577 |
## evening              | 0.137 0.426 0.041 |
## sex                  | 0.452 0.041 0.198 |
## sugar                | 0.458 0.022 0.006 |
## where                | 0.178 0.170 0.196 |
## lunch                | 0.001 0.526 0.052 |

The summary of the MCA lists the explained variances. The first dimension explains 18,55% of the variance the second dimension explains 16,99%, the third 15,29%.

plot(mca, invisible=c("ind"), habillage="quali")


6: Analysis of longitudinal data

This week’s tasks focus on analysis of longitudinal data. Analysis will be conducted on two datasets in the way presented in chapters 8 and 9 from the book “Multivariate Analysis for the Behavioral Sciences”.

The datasets have been wrangled from wide form to long form before analysis. The R script file that created the data files can be viewed here.

rm(list=ls())
setwd("~/Library/Mobile Documents/com~apple~CloudDocs/Documents/GitHub/IODS-project/data")

library(dplyr)
library(stringr)
library(ggplot2)
library(GGally)
library(tidyr)
library(readr)
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
## 
##     expand
RATSL <- read.table("RATSL.txt")
BPRSL <- read.table("BPRSL.txt")

RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)

6.1: Implement the analyses of chapter 8 of MABS using the RATS data

The RATS data consists of three groups of rats. There are 16 rats in total. The groups were exposed to different diets and their weight in grams were measured weekly for nine weeks. The dimensions (rows and columns) of the RATS data are:

## [1] 176   5

Let us visualize the weight increase of the rats by groups:

From the graph we can see that the rat weights seem to increase in all groups over time. There is also one big rat in group 2 that is much heavier than other rats in that group. In groups 2 and 3 the weight seem to have increased more than in group 1. Let us standardize the values to make the groups easier to compare, and plot the standardized data:

There does not seem to be huge differences in weight changes between groups. The same information can be seen from different visualization below, where the plot is built based on means and standard errors:

Here we can see that the group 2 varies the most (remember that this group has the one rat that is bigger than others) and group 1 has least variation. Let us remove that one big rat from the group 2 and perform summary measure analysis with boxplots:

All the groups have different means. Now we must conduct analysis of variance (ANOVA) to study whether the differences between groups are statistically significant. In this test the null hypothesis is that the group means do not differ.

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        2 188182   94091   580.4 7.07e-12 ***
## Residuals   11   1783     162                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the ANOVA test the p-value is close to zero. Therefore we can infer that the groups differ in group means.

6.2: Implement the analyses of chapter 9 of MABS using the BPRS data

The BPRS data consists of 40 male subjects that were divided into two groups that received different treatments. All subjects were rated on BPRS (brief psychiatric rating scale) for 8 weeks in total. The scale evaluates schizophrenia among the patients. The dimensions of the dataset (rows and columns) are

## [1] 360   5

Here is a plot of the two groups showing changes among individuals through eight weeks:

This plot is not very informative: it does not tell much about the differences between the treatments, although group 1 seems to be having more coherent effect of decreasing through time. Let us conduct a basic linear regression to study this data further:

## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

The results indicate that according to this model the treatment is not statistically signicantly associated with the development of the BPRS values. Weeks on the other hand seem to have statistically signicant effect. Next let us try fitting a linear mixed effects model that takes into account the fact that our observations are not independent.

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

In this model we can see that standard errors of treatment and week are smaller than in previous model. Let us test which model is better with ANOVA:

## Data: BPRSL
## Models:
## BPRS_reg: bprs ~ week + treatment
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
##          Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## BPRS_reg  4 2837.3 2852.9 -1414.7   2829.3                             
## BPRS_ref  5 2748.7 2768.1 -1369.4   2738.7 90.624      1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA indicates that the linear mixed effects model is better of the two models.

Let us plot the model and compare it with original data:

We can see from the plot that it does give fairly good understanding of the treatments between groups. The first group has a more clear decrease in BPRS values in the model version, just as it was in the original data.